Do speech behaviours related to confidence and uncertainty vary between men and women?
Among all species on Earth, humans have a unique capability of communication using a symbolic communication system, i.e., verbal and written language. The highly sophisticated language enables humans to communicate in a very precise and complex manner. Still, communicative speech acts seem to differ between genders. One of the major differences in women and men’s speech is that men have been found1,2 to dominate conversations through the use of interruptions and overlaps. Additionally, men use strong expletives, while women use politer versions.
In this project we investigate the variety of speech which is related to specific gender, social norms and variations in the use of language among those genders. We suppose men and women have different speech behaviours, women talking with more uncertainties (doubts). For example, we expect a woman to say “I expect this to do that” while a man would rather say “I know this does that”. Our idea is therefore to analyse whether there is a real difference between genders and, if so, to what extent it is the case.
We are interested in using this dataset to answer the following question:
To answer this question, we'll go through the following points:
In the following, we analyse the data from Quotebank, an open corpus which gathers 178 million quotations (attributed to speakers) from 2008 to 2020. Still, in this project, we will only focus on the most recent quotations, being from 2015 to 2020. We combine this dataset with speakers’ information from Wikidata, a collaboratively edited open source knowledge base.

To have a general overview of the speakers’ occupations, we focus on four main professional fields: arts, science, economy and politics. Our speakers are then regrouped under professions from each professional field. Then, to determine the roles of nationality, religion and education in determining a possible culture gender difference in communicative acts, we selected a general data frame with no condition on profession.
To analyse speech uncertainty, we adapted an already existing uncertainty detection classifier3, using 6 features. Uncertainty is defined by speculative verbs (like, suggest, presume), adjectives and adverbs (like, probably, possibly), auxiliary verbs (like, must, should) or the use of some tense or modes of conjugation (subjunctive, conditional). This classifier is an automatic machine learning method to detect uncertainty in natural language.
Before starting to investigate our research questions, let's have a look at what does our data look like.
We see that there are 32 genders present in Wikidata. In this analysis, we will only focus on 2 genders: “Female” and “Male”.
We notice that the majority of our the quotes are in English. Still, there are some non-English quotes in the dataset. These are removed from our analysis and we only keep the English ones.